CEN@Amrita: Information Retrieval on CodeMixed Hindi-English Tweets Using Vector Space Models
نویسندگان
چکیده
One of the major challenges nowadays is Information retrieval from social media platforms. Most of the information on these platforms is informal and noisy in nature. It makes the Information retrieval task more challenging. The task is even more difficult for twitter because of its character limitation per tweet. This limitation bounds the user to express himself in condensed set of words. In the context of India, scenario is little more complicated as users prefer to type in their mother tongue but lack of input tools force them to use Roman script with English embeddings. This combination of multiple languages written in the Roman script makes the Information retrieval task even harder. Query processing for such CodeMixed content is a difficult task because query can be in either of the language and it need to be matched with the documents written in any of the language. In this work, we dealt with this problem using Vector Space Models which gave significantly better results than the other participants. The Mean Average Precision (MAP) for our system was 0.0315 which was second best performance for the subtask.
منابع مشابه
CEN@Amrita FIRE 2016: Context based Character Embeddings for Entity Extraction in Code-Mixed Text
This paper presents the working methodology and results on Code Mix Entity Extraction in Indian Languages (CMEE-IL) shared the task of FIRE-2016. The aim of the task is to identify various entities such as a person, organization, movie and location names in a given code-mixed tweets. The tweets in code mix are written in English mixed with Hindi or Tamil. In this work, Entity Extraction system ...
متن کاملSentence Boundary Detection for Social Media Text
The paper presents a study on automatic sentence boundary detection in social media texts such as Facebook messages and Twitter micro-blogs (tweets). We explore the limitations of using existing rule-based sentence boundary detection systems on social media text, and as an alternative investigate applying three machine learning algorithms (Conditional Random Fields, Naïve Bayes, and Sequential ...
متن کاملNLP CEN AMRITA @ SMM4H: Health Care Text Classification through Class Embeddings
Artificial Intelligence has been a major breakthrough in many domains. Now, it has started automating health care domain through Natural Language Processing and Computer Vision applications. As a part of it, researchers are now focusing more on mining health related information from the text shared through social media and clinical trials. This paper explains about our system for health care te...
متن کاملOverview of the FIRE 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis)
The FIRE 2017 Information Retrieval from Microblogs during Disasters (IRMiDis) track focused on retrieval and matching of needs and availabilities of resources from microblogs posted on Twitter during disaster events. A dataset of around 67,000 microblogs (tweets) in English as well as in local languages such as Hindi and Nepali, posted during the Nepal earthquake in April 2015, was made availa...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016